A Robust and Extensible Tool for Data Integration Using Data Type Models
نویسندگان
چکیده
Integrating heterogeneous data sets has been a significant barrier to many analytics tasks, due to the variety in structure and level of cleanliness of raw data sets requiring one-off ETL code. We propose HiperFuse, which significantly automates the data integration process by providing a declarative interface, robust type inference, extensible domain-specific data models, and a data integration planner which optimizes for plan completion time. The proposed tool is designed for schema-less data querying, code reuse within specific domains, and robustness in the face of messy unstructured data. To demonstrate the tool and its reference implementation, we show the requirements and execution steps for a use case in which IP addresses from a web clickstream log are joined with census data to obtain average income for particular site visitors (IPs), and offer preliminary performance results and qualitative comparisons to existing data integration and ETL tools.
منابع مشابه
A Fully Integrated Method for Dynamic Rock Type Characterization Development in One of Iranian Off-Shore Oil Reservoir
Rock selection in modeling and simulation studies is usually based on two techniques; routinely defined rock types and those defined by special core analysis (SCAL). The challenge in utilizing these two techniques is that they are frequently assumed to be the same, but in practice, static rock-types (routinely defined) are not always representative of dynamic rock-types (SCAL defined) in the re...
متن کاملComparison of ordinary logistic regression and robust logistic regression models in modeling of pre-diabetes risk factors
Background: Regarding the increased risk of developing type 2 diabetes in pre-diabetic people, identifying pre-diabetes and determining of its risk factors seems so necessary. In this study, it is aimed to compare ordinary logistic regression and robust logistic regression models in modeling pre-diabetes risk factors. Methods: This is a cross-sectional study and conducted on 6460 people, over ...
متن کاملGeostatistical estimation to delineate oxide and sulfide zones using geophysical data; a case study of Chahar Bakhshi vein-type gold deposit, NE Iran
Delineation of oxide and sulfide zones in mineral deposits, especially in gold deposits, is one of the most essential steps in an exploration project that has been traditionally carried out using the drilling results. Since in most mineral exploration projects there is a limited drilling dataset, application of geophysical data can reduce the error in delineation of the sulfide and oxide zones....
متن کاملRobust Portfolio Optimization with risk measure CVAR under MGH distribution in DEA models
Financial returns exhibit stylized facts such as leptokurtosis, skewness and heavy-tailness. Regarding this behavior, in this paper, we apply multivariate generalized hyperbolic (mGH) distribution for portfolio modeling and performance evaluation, using conditional value at risk (CVaR) as a risk measure and allocating best weights for portfolio selection. Moreover, a robust portfolio optimizati...
متن کاملThe eXtensible ontology development (XOD) principles and tool implementation to support ontology interoperability
Ontologies are critical to data/metadata and knowledge standardization, sharing, and analysis. With hundreds of biological and biomedical ontologies developed, it has become critical to ensure ontology interoperability and the usage of interoperable ontologies for standardized data representation and integration. The suite of web-based Ontoanimal tools (e.g., Ontofox, Ontorat, and Ontobee) supp...
متن کامل